Joe, Krystin, Rohan
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
cmd = "pip install --upgrade iqplot datashader bebi103 watermark"
process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
data_path = "../data/"
# ------------------------------
try:
import multiprocess
except:
import multiprocessing as multiprocess
import warnings
import numpy as np
import pandas as pd
import bebi103
import iqplot
import scipy
import scipy.stats as st
import holoviews as hv
import holoviews.operation.datashader
hv.extension('bokeh')
import bokeh
bokeh.io.output_notebook()
bebi103.hv.set_defaults()
import numpy.random
rg = numpy.random.default_rng()
import random
import numba
import panel as pn
from scipy.stats import gamma
Attributions: Joe
First we load the data set and make it tidy.
# Download Data and wrangle into tidy format
df = pd.read_csv("../data/gardner_mt_catastrophe_only_tubulin.csv", header=9)
df12 = pd.DataFrame(df['12 uM'])
df12.columns = ['time']
df7 = pd.DataFrame(df['7 uM'])
df9 = pd.DataFrame(df['9 uM'])
df10 = pd.DataFrame(df['10 uM'])
df14 = pd.DataFrame(df['14 uM'])
con = [12,7,9,10,14]
frames = [df12, df7, df9, df10, df14]
for i, df in enumerate(frames):
# get name
name = df.columns.tolist()
df.columns = ['time']
df["concentration"] = con[i]
df = pd.concat(frames)
df = df.dropna()
df = df.sort_values('concentration')
df
Now that the data is tidy, we can create some visualizations to better understand how microtubule catastrophe times are for different tubulin concentrations. We did this by first making a stripbox plot so.
p = iqplot.stripbox(
data=df,
x_axis_label = 'Time (s)',
y_axis_label = 'Concentration (uM)',
width=500,
height=500,
q='time',
cats = 'concentration',
jitter=True,
top_level = "box",
whisker_kwargs=dict(line_color='#00000f', line_width=.8),
box_kwargs=dict(line_color='#00000f', line_width=.8),
median_kwargs=dict(line_color='#00000f', line_width=.8),
marker_kwargs=dict(alpha=0.3),
)
p1 = iqplot.ecdf(
data=df,
q='time',
cats=['concentration'],
kind='colored',
conf_int=True,
)
bokeh.io.show(p)
bokeh.io.show(p1)
Above we have plotted a stripbox plot and the ecdf of the distribution of catastrophe times for the microtubules separated by the different concentrations of total tubulin. From the above plots, we can see that the concentration of tubulin does not affect the time it takes to catastrophe. This is best seen in the above ECDF plot where we can see that the distribution of catastrophe times for the different concentrations of tubulin are nearly identical.
Now, we will use our work from HW 8.2 to create two different potential generative models for the distribution of catastrophe times with microtubules with a 12 uM concentration of tubulin.
def log_like_iid_gamma(params, n):
"""Log likelihood for i.i.d. NBinom measurements, parametrized
by alpha, beta."""
alpha, beta = params
if alpha <= 0 or beta <= 0:
return -np.inf
return np.sum(st.gamma.logpdf(n, alpha, loc = 0, scale = 1/beta))
def mle_iid_gamma(n):
"""Perform maximum likelihood estimates for parameters for i.i.d.
gamma measurements, parametrized by alpha, beta"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_gamma(params, n),
x0=np.array([1, 1/2]),
args=(n,),
method='Powell'
)
if res.success:
return res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
def gen_fun_gamma(params, n, size, rg):
alpha, beta = params
return st.gamma.rvs(alpha, loc = 0, scale = 1/beta, size = size)
data = df.loc[df['concentration'] == 12, 'time']
gamma_mle = mle_iid_gamma(data)
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
mle_iid_gamma,
gen_fun_gamma,
data,
gen_args=(data, ),
size=1000,
n_jobs=3,
progress_bar=True,
)
conf = np.percentile(bs_reps, [2.5, 97.5], axis=0)
alpha_g = gamma_mle[0]
beta_g = gamma_mle[1]
print('''alpha
MLE: {}
95% Confidence Interval {}
'''.format(gamma_mle[0], conf[:, 0]))
print('''beta
MLE: {}
95% Confidence Interval {}
'''.format(gamma_mle[1], conf[:, 1]))
def log_like_iid_exp(params, t):
"""Log likelihood for i.i.d. NBinom measurements, parametrized
by beta1, beta2."""
beta_1, beta_2 = params
dB = beta_2 - beta_1
if beta_1 <= 0 or beta_2 <= 0:
return -np.inf
if not (dB >= 0):
return -np.inf
# If beta 1 ~ beta 2, we use a different pdf with just beta_1
if np.abs(beta_1 - beta_2) < .0001:
pdf = (beta_1**2) * t * np.exp(-beta_1*t)
else:
c = (beta_1 * (beta_1 + dB))/dB
pdf = c * np.exp(-beta_1*t)*(1-np.exp(-dB*t))
return np.sum(np.log(pdf))
def mle_iid_exp(times):
"""Perform maximum likelihood estimates for parameters for i.i.d.
gamma measurements, parametrized by beta1, beta2"""
with warnings.catch_warnings():
warnings.simplefilter("ignore")
res = scipy.optimize.minimize(
fun=lambda params, n: -log_like_iid_exp(params, times),
x0=np.array([.01, .011]),
args=(times,),
method='Powell'
)
if res.success:
return res.x
else:
raise RuntimeError('Convergence failed with message', res.message)
mle_exp = mle_iid_exp(np.array(data))
def gen_fun_exp(params, n, size, rg):
beta_1, beta_2 = params
b_1 = rg.exponential(1/beta_1, size)
b_2 = rg.exponential(1/beta_2, size)
return b_1 + b_2
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
mle_iid_exp,
gen_fun_exp,
data,
gen_args=(data, ),
size=2000,
n_jobs=3,
progress_bar=True,
)
conf_exp = np.percentile(bs_reps, [2.5, 97.5], axis=0)
beta1_exp = mle_exp[0]
beta2_exp = mle_exp[1]
print('''beta_1
MLE: {}
95% Confidence Interval {}
'''.format(mle_exp[0], conf_exp[0,:]))
print('''beta_2
MLE: {}
95% Confidence Interval {}
'''.format(mle_exp[1], conf_exp[1, :]))
Attributions below: Krystin
To compare the two models, we decided to use predictive ECDFs to determine whether MLE for parameters were a good representation of the true generative model, so we plotted how data generated from both models looked. We used parametric bootsrapping to generate data sets and then plot confidence intervals of the resulting ECDFs, with the measured ECDF overlayed. We used predictive_ecdf from the bebi103 package and see the plots generated below. We also decided to plot the difference between the predictive ECDF and the measured ECDF. By plotting the difference, we can more easily see which model is best.
times = data
rg = np.random.default_rng()
gamma_samples = np.array(
[rg.gamma(alpha_g, 1 / beta_g, size=len(data)) for _ in range(1000)]
)
exp_samples = np.array(
[rg.exponential(1/beta1_exp, size=len(data)) + rg.exponential(1/beta2_exp, size=len(data)) for _ in range(1000)]
)
# Gamma
p1 = bebi103.viz.predictive_ecdf(
samples=gamma_samples, data=times, discrete=True, x_axis_label="n"
)
bokeh.io.show(p1)
# Gamma
p1 = bebi103.viz.predictive_ecdf(
samples=gamma_samples, data=times, diff='ecdf', discrete=True, x_axis_label="n"
)
bokeh.io.show(p1)
For the figures above, the lighter blue contains the middle 95% of the generated ECDFs while the darker blue captures the middle 68%. From these plots, we can see that a lot of measured ECDF lies within 95% of the Gamma distributed ECDF.
# Exponential
p2 = bebi103.viz.predictive_ecdf(
samples=exp_samples, data=times, discrete=True, x_axis_label="n", color="red",
)
bokeh.io.show(p2)
# Exponential
p2 = bebi103.viz.predictive_ecdf(
samples=exp_samples, data=times, diff='ecdf', discrete=True, x_axis_label="n", color="red",
)
bokeh.io.show(p2)
For these figures above, it is more apparent that two successive Poisson processes generative model does not fit the data as well, since most of the points lie outside of the 95% confidence.
p1 = bebi103.viz.predictive_ecdf(
samples=gamma_samples, data=times, discrete=True, x_axis_label="n"
)
p2 = bebi103.viz.predictive_ecdf(
samples=exp_samples, data=times, discrete=True, x_axis_label="n", p=p1, color="red",
)
bokeh.io.show(p1)
The above plot measures shows the CDFs of the gamma distribution (yellow), the difference in two poisson arrival distributions (red), and the eCDF of the catastrophe times for microtubules with a 12 uM concentration of tubulin. The above graph shows that the gamma distribution more closely resembles the distribution of catastrophe times compared to the difference in Poisson distribution. Therefore, we will be using the gamma distribution to model the catastrophe times of microtubules moving forward.
Code: Joe
mles = []
for i, concentration in enumerate([7, 9, 10, 12, 14]):
data = pd.DataFrame(df.loc[df['concentration'] == concentration, 'time'])
gamma_mle = mle_iid_gamma(data)
bs_reps = bebi103.bootstrap.draw_bs_reps_mle(
mle_iid_gamma,
gen_fun_gamma,
data,
gen_args=(data, ),
size=1000,
n_jobs=3,
progress_bar=False,
)
conf = np.percentile(bs_reps, [2.5, 97.5], axis=0)
mles.append(gamma_mle)
print('Conentration : {} uM'.format(concentration))
print('''alpha
MLE: {}
95% Confidence Interval {}
'''.format(gamma_mle[0], conf[:, 0]))
print('''beta
MLE: {}
95% Confidence Interval {}
'''.format(gamma_mle[1], conf[:, 1]))
Krystin:
Looking at the values of the parameters versus tubulin concentration, we can see that $\alpha$, the number of arrivals of microtubule catastrophe, generally increases, with the exception of concentration 12uM where the calculated $\alpha$ and corresponding confidence intervals, are smaller than that for concentration 10uM. Meanwhile, $\beta$, the rate of microtubule catastrophe is largest at 10uM and decreases towards the tails 7uM and 14uM. It is difficult to conclusively say something about which tubulin concentration will yield the fastest and most often MT catastrophes, but we can say that when tubulin concentrations are greater than or equal to 10uM, there are more occurances of MT catastrophe. This could mean that there is a threshold of tubulin concentration for which microtubules can utilize for polymerization.
%load_ext watermark
%watermark -v -p numpy,pandas,jupyterlab